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ABSTRACT 

Subspace iteration is a reliable and cost effective method for solving positive definite 
banded symmetric generalized eigenproblems, especial^ in the case of large scale problems. 
This paper discusses an algorithm that makes use of two parallel banded solvers in subspace 
iteration. A shift is introduced to decompose the banded linear systems into relatively 
independent subsystems and to accelerate the iterations. With this shift, an eigenproblem 
is mapped efficiently into the memories of a multiprocessor and a high speed-up is obtained 
for parallel implementations. An optimal shift is a shift that balances total computation and 
communication costs. Under certain conditions, we show how to estimate an optimal shift 
analytically using the decay rate for the inverse of a banded matrix, and how to improve this 
estimate. Computational results on iPSC/2 and iPSC/860 multiprocessors are presented. 
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1 Introduction 


Eigenvalue problems arise in many areas of physics and engineering such as the analysis of 
electron orbits in atoms and the stability of structures. Due to the large number of applica- 
tions of such problems, there is a constant demand for algorithms for computing eigenvalues. 
The development of efficient algorithms has received considerable attention in the litera- 
ture [1] [2] [23] [29]. With the increased use of advanced computers, parallel algorithms 
are also becoming available [11] [17] [19] [20] [22] [24] [28]. Most of these algorithms begin 
by reduction of the problem to a standard form. This is particularly true for generalized 
eigenproblems [5]. 

This paper presents two parallel banded linear solvers and their application for general- 
ized positive definite eigenvalue problems, in which, only a few of the smallest eigenvalues 
and corresponding eigenvectors are needed to moderate accuracy. This type of problem arises 
in structural analysis and other engineering fields [2] [15]. Among all solution methods, two 
families are most popular: subspace iteration [2] [4] [25] [26] and the Lanczos method [12] 
[16]. The Lanczos method has been shown to be superior to subspace iteration on sequential 
and vector machines [15] [21]. The comparison of the two families on a parallel computer 
is relatively unknown. Though the Lanczos method is strongly favored by mathematicians, 
subspace iteration is more often used in engineering, particularly in structural analysis. This 
may be because subspace iteration is conceptually simple and closely associated with sub- 
structure, based on which, one can often construct an approximate subspace from experience. 
Many software codes are still using subspace iteration for computing a few dominant eigen- 
values and corresponding eigenvectors. The major effort in subspace iteration is devoted to 
solving banded linear systems. This paper discusses an algorithm that makes use of two 
parallel banded linear solvers in subspace iteration. The main feature of this algorithm is 
the introduction of a shift that decomposes the banded linear system into relatively inde- 
pendent subsystems and accelerates the subspace iteration. We shall show when this shift 
is applicable, how to estimate this shift analytically using the decay rate for the inverse of 
a banded matrix, and how to improve this estimate. The comparison of subspace iteration 
with the Lanczos method is beyond the consideration of this paper. 

This paper is organized as follows. In Section 2 two parallel tridiagonal solvers [27] are 
extended to banded solvers. The second of these makes use of the decay of the inverse 
of banded matrices. Theoretical and numerical results are presented for the comparison 
of efficiencies. In Section 3, the parallel subspace iteration algorithm for the generalized 
eigenvalue problem is described. A shift is introduced and analyzed, and computational 
results are presented. Section 4 concludes by pointing out advantages and limitations of our 
algorithm. 

2 Parallel banded solvers 

The Parallel Partition LIJ (PPT) algorithm and the Parallel Diagonal Dominant (PDD) 
algorithm were proposed for solving tridiagonal linear systems on multicomputers in [27]. 
Here we extend them to banded linear systems. The PPT algorithm is similar to an algorithm 
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introduced by Lawrie and Sameh in [18]. The PDD algorithm is a variant of PPT which 
uses the fact that the entries in the inverse of a banded matrix decay away from the main 
diagonal. 

Let us consider a system of order n 


Ax = d 


( 1 ) 


where A is a nonsingular banded matrix with lower band width mi and upper band width m u ; 
specifically, a tJ = 0 if i — j > mi or j — i > m u . Let p denote the number of processors used, 
and for convenience assume that n = pn s and that the half band width m := max(m/, m u ) < 
n s /2. Following [18] we partition A as a block p x p matrix with blocks of order n s : 


[ A \ 

B-2 A 2 

b 3 


c 2 

a 3 


B„ 


C , ,_i 

A n 


Let A = diag [/lj, A 2 , . . . , A v \ and write A = A + A^4. The main assumption of the PPT 
and PDD algorithms is that both A and A are nonsingular. The n s x n s submatrices B{ and 
Ci have the form 


B t = 


Ri 

0 


Ci = 


0 

u 


where Ri is m; x mi upper triangular and Li is m u x m u lower triangular. 
Let V and E be block p x (p — 1) matrices of the form 



r ? 1 . i 


- F 


B 2 C 2 


G F 

v = 

b 3 ••• 

, E = 

G 




F 

- 

B p J 


G. 


with n, x (mi + m u ) blocks 

Ci = 


0 0 ‘ 

, Bi = 

Ri 

O' 

, F = 

■ 0 O' 

, G = 

■0 7 mu ' 

.0 m 

1 t 

. 0 

0. 


. 7 m , 0. 

? ^ 

.0 0 


where Ik denotes the identity matrix of order k. Set n r := (mi + m u ) (p — 1). Then V and 
E are n x n T matrices and A/l = V E T . Next, choose any n x (n — n r ) matrix H so that 
P = [E\H] is an n x n permutation matrix. Then E T E = /„ r , H T H = 7 n _„ r , E T H = 0, 
and H T E = 0, and it follows that 


P T A~ l AP = 


Z 0 

L H T A~'V I n . T 


where Z — 7„ r + E T A 1 V . Consequently, if A and A are nonsingular, so is Z. 
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Now using the Sherman-Morrison-Woodbury formula [13], the solution of equation (1) 
can be expressed as 

x = A~'d= {A + VE T y l d 

= A~ l d- A~'V(I nr + E T A- 1 V)- t E T A~ 1 d. 

Using this representation equation (1) can be solved in the following steps: 

1. Solve A(x,Y) = (d,V) 

2. Form Z = I nr + E T Y and h — E T x, then solve 

Zy = h. 

3. Calculate x = x — Yy. 


The matrix Z in Step 2 is called the reduced matrix, and has the form 



Z = 


Z 12 

fm u ^24 

1 mi Z34 

Z 43 Im u 
Z 53 


1 mi 

■^2p-2,2p-3 


■^2p-4,2p-2 

■^2p-3,2p-2 

fm u 


The matrix Z can be reformulated as a tridiagonal block matrix by the permutation of the 
(2 i — l)-th and the 2i-th column blocks. Therefore the reduced system can be treated as a 
banded system with the upper and lower band widths of m/ + m u — 1. We reiterate that the 
above matrix partitioning scheme is essentially the same as that proposed in [18]. 

We now outline the PPT algorithm. 


Parallel Partition LU (PPT) Algorithm: 

• Input: A, d 


• Output: x 

• In parallel, do on all processors Pi, i = 1, . . . ,p: 


1. Solve Ai(x{,Yi) = (d{, Vi), where V = [Vi, . . . ,V P ] T , Y = [Y^, . . . ,Y P ] T , with V, 
and Yi n s x n r blocks, and d = [d\, . . . ,d p ] T , x = \x\, . . . ,x p ] with d, and ii 
^-vectors. Stop if A, is algorithmically singular. 

2. Following an all-to-all communication, form the reduced system Zy = h. 
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3. Solve the reduced system. 

4. Calculate the i — th block of the solution x: X{ = — Y t y. 


The total number of data transferred is (m/ + m u ) (mi + m u + l)(p— 1) (see [27]). The 
number of flops is roughly (8 ram 2 + 5 nm)jp + 16 (p — l)m 3 with m — max(m/,m u ), the half 
band width. 

As indicated in [27], the PPT algorithm performs well for narrow banded systems when 
the number of processors is small (say p < 16). However, the efficiency of the algorithm 
decreases quickly as either the number of processors p or the half band width m increases, 
since Steps 2 and 3 of the algorithm are a bottleneck in both computation and data com- 
munication. This is true for all known variants of the matrix tearing technique used here [9] 
[10]. Under certain conditions this bottleneck can be removed. As was already mentioned 
above, the entries of the inverse of a nonsingular, banded matrix decay away from the main 
diagonal (see [6]). When the decay in the inverse of A is sufficiently rapid, the reduced 
matrix Z can be approximated by a block diagonal matrix Z with blocks of order mi + m u . 
The PDD algorithm is the PPT algorithm with Z replaced by Z, When the PDD algorithm 
can be used, it is much more efficient than the PPT algorithm. 

We now analyze the conditions under which the PDD algorithm is applicable. For sim- 
plicity, we assume that the half band width m == mi ~ m u . We will use the notation 
A ( i i : i 2 , j\ : j 2 ) to denote the (i 2 — 2*1 + 1) x (j 2 — j\ + 1) submatrix of A with upper left 
corner at entry (iuj\) and lower right corner at entry (i 2 ,j 2 ). Also, let || || denote the 

co-norm. 

For i — 1 , ... , p — 2 the off-diagonal blocks Z 2 {+ \ y2 {-\ and Z 2 i; 2 {+ 2 are obtained from 


and 


and lienee 


Z 2% ,2i+2 = A+ 1 0 : rn,n a - rn +1 : n 5 ) L i+1 

^2<+i,2«-i = /»f+\ ( n * ~ m + 1 : n *i 1 : m ) ^t+i, 


\\Zn- 2 i+ 2 \\ < |M i+ \ (1 : m,n s - m + 1 : n s )||||L t+] || 

and 

ll^2i+i,2»-iii < iMr+i ( n « - m + 1 ; 1 ; m )iiii^«+iii- 

Using Proposition 2.3 of [6], it can be shown that there are constants C and q, C > 0 and 
0 < q < 1 , so that 

max { ||/V\ (1 : m, n s - m + 1 : n s ) ||, || (w, - m + 1 : n s , 1 : m)||| < Cq d%t,m 


where dis = n s — 2m -j- 1 is the shortest distance from the main diagonal to an entry of 
(1 : m,n s — m + 1 : n s ) or ,4“^ (n s — m + 1 : n s , 1 : m). If A is positive definite, dis 
can be replaced by 2 dis. When n s /m 1, these off-diagonal blocks can be neglected and 
the reduced matrix Z can be replaced by the block diagonal approximation Z. Our test 
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for applicability of the PDD algorithm is as follows. Let u denote machine precision (unit 
roundoff), and let c denote a small positive constant. We use PDD if 


max 1 <,-<p _ 2 { || ^2i',2t+2 ||i || ^2>+l,2t— 1 1 | } < cu ^) 

\\ z \\ 

This means that PDD is used only if the perturbation in Z caused by zeroing the off-diagonal 
blocks is comparable to the roundoff error introduced by solving the reduced system by a 
numerically stable method. Equation (2) will be referred as test (2) in the following sections. 
We now outline the PDD algorithm. 


Parallel Diagonal Dominant (PDD) Algorithm: 

• Input: m-banded A, d 

• Output: Xdd (an approximate solution of Ax = d ) 

• In parallel, do on all processors P{, i = 1, . . . ,p: 

1. Solve Ai(xi, Yj) = (d,, Vi). 

2. Send Y t (1 : to, 2m (i — 2) + 1 : 2m ( i — 2) + m) and x,(l : m) to the processor 
Pi _ i (i / 1). Form the i-th block of matrix Z (i ^ p ): 

I m Z2i-\,2i 

- Zi2i;ii—\ Jm 


3. Solve 


Pn 


Z'2i — \ ,2 


y (2m (i — 1) + 1 : 2mi) = h (2m (i — 1) m + 1 : 2 mi) 


on Pi ( i 7 ^ p), then send y (rn (2i — 1) + 1 : 2 mi) to the processor P !+] . 

4. Calculate the i-th block of x^d- (x<td)i = ii — Yiy. 


Note that the system at Step 3 can be treated as a banded system with half band 
width m if a permutation of column blocks is used. The total number of flops is roughly 
(8 nm 2 -T 5 nm)/p + 4m 3 , and the number of data transferred is m 2 T 2 rn. 

Our Hybrid Algorithm is simply: use PDD if applicable; otherwise use PPT. 

The PPT and the PDD algorithms have been implemented on a 64-node NCUBE-1 
computer for tridiagonal linear systems [27]. We present in Tables 1 and 2 the numerical 
results of the two algorithms for banded matrices on a 16-node Intel iPSC/2 computer. The 
testing banded matrices are of the form A = [a,j], with 

f 1 if 0 < |f — j| < m 

dij = < 2 rn + s if i = j 

( 0 otherwise 


Table 1: 


m 

n 

P 

Execution Time (seconds) 

Ratio 




PPT 

SPT 

PDD 

SDD 

SPT/PPT 

SDD/PDD 

PPT/PDD 

10 

640 

1 










2 

2.624 

4.811 

2.598 

4.768 

1.83 

1.84 

1.01 

(s=69) 


4 

2.224 

6.429 

1.887 

6.144 

2.89 

3.26 

1.18 



8 

1.907 

7-434 

0.922 

6.700 

3.90 

7.27 

2.07 

10 

3200 

1 










2 

13.186 

24.231 

13.160 

24.288 

1.84 

1.85 

1.00 

(s=-9) 


4 

10.092 

32.117 

9.741 

31.800 

3.18 

3.26 

1.04 



8 

5.801 

36.010 

4.816 

35.160 

6.21 

7.30 

1.20 



16 

4.643 

38.109 

2.370 

36.390 

8.21 

15.35 

1.96 

20 

”3200 

1 










2 

43.678 

78.893 

43.514 

78.979 

1.81 

1.82 

1.00 

(•=12) 


4 

34.774 

106.706 

32.629 

104.976 

3.07 

3.22 

1.07 



8 

22.408 

121.335 

16.106 

116.594 

5.41 

7.24 

1.39 



16 

22.319 


7.771 

118.908 


15.30 

2.87 


where s is chosen so that the off-diagonal blocks in the reduced matrix Z can be neglected. 
In our numerical experiments, we have used Gauss elimination with partial pivoting to solve 
the reduced system and a value c = 10 for the test in equation (2). In our experiments, the 
accuracy of the PPT^ and the PDD algorithms is comparable to working precision. 

Table 1 shows the execution time for different algorithms and the relative speedups. SPT 
and SDD are the sequential algorithms for PPT and PDD. They are executed on a single 
processor. We use them to measure the relative speedup of the PPT and PDD algorithms. 
In our tables, n is the order of the test matrix and p is the number of processors being used 
for all the algorithms except SPT and SDD. For SPT and SDD, p is the number of blocks 
in the partitioning scheme. 

As already indicated, the solution of the reduced system Zy = h is a computation and 
communication bottleneck of the PPT algorithm. Table 2 contains, for each algorithm, 
columns for the percentage of time spent on communication (comm. % ) and for the time 
spent on computation (comp. % ) at Step 2 and 3. It is clear from both theoretical and 
experimental results that the PDD algorithm is much more efficient when it is applicable. 


3 A Parallel Algorithm for the Banded Generalized 
Eigenvalue Problem 

In this section, we consider generalized symmetric eigenproblems of the form 

Ax = \Bx , (3) 

where A and B are symmetric, m-banded matrices and B is positive definite. We wish to 
approximate the q, q •< n, smallest eigenvalues Ai < A 2 < . . . < A, and corresponding 
eigenvectors x^,...,x q . This type of problem frequently arises in vibration mode analysis 
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[2] [3], [15], where A and B are, respectively, the stiffness matrix and mass matrix. The 
eigenvalues A* are the squares of the the free vibration frequencies and the eigenvectors x % 
are the corresponding mode shape vectors. 

Many sequential algorithms for solving the symmetric eigenvalue problem have been 
developed including those in [1], In recent years parallel algorithms for advanced computers 
have appeared such as those found in [11] [15] [17] [19] [20] [22] [24]. Most of these algorithms 
require reduction to tridiagonal form for an equivalent standard problem, or computation of 
all eigenpairs. The algorithm proposed in this paper is a parallel subspace iteration algorithm 
for finding the q smallest eigenvalues and corresponding eigenvectors that takes advantage of 
the assumption that A and B are banded. Subspace iteration [2] [4] [23] [25] [26] is a reliable 
and cost effective method for solving the eigenvalue problem considered here to moderate 
accuracy. It has been used extensively in a number of general purpose finite element analysis 
programs [2]. 

Using a shift $, a symmetric matrix pencil (A, £?), with B positive definite, can be 
transformed to a positive definite matrix pencil ( A — sB, B). This shift leaves eigenspaces 
unchanged. For the moment let us assume that the pencil (A,B) is positive definite. 

The basic subspace iteration algorithm then consists of the following steps: 

• Establish a starting subspace of dimension q spanned by the columns of where 
q > q and q is the number of eigenpairs to be calculated. 

• For k = 0, 1, , until the first q eigenvalues satisfy a stopping criterion. 

1. apply the Rayleigh-Ritz (RR) procedure to extract the “best” eigenvalue and 
eigenvector approximations from S^ k \ 

2. improve 5^) by an inverse iteration 

AS (k + 1} = BS {k) . 

The starting subspace can be generated as discussed in [2] or can be generated randomly. 
The user can input the value of q\ the default value is q — min{2<7, q + 8} based on the 
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numerical experiments found in [2]. The details of the RR procedure are described in the 
RR algorithm : 


RR Algorithm: 

• Input: S 

• Output: eigenpair approximations for Ax — \Bx from S 

1 . Orthonormalize the columns of S with respect to B to obtain Q written over S, 
i.e. Q T BQ = I. 

2. Form the Rayleigh Quotient: Ha = Q T AQ. 

3. Find eigenpairs of Ha' (Oj, j = 1, • • ■ , 9 . 

4. Form the “best” eigenpair approximations from S: (0j,Xj), with x } = j = 
1, •••,9- 


Let 



dHi) _ m 

12 j 

Q (k) _ 0(fc-1)’ 


j = l,..., q. 


( 4 ) 


Since limfc—oo = (Aj/A^+i) 2 , the efficiency of the subspace iteration method depends 
on A,/A^ +1 , the ratio of the largest desired eigenvalue A, and the eigenvalue A 9 - + i for the 
pencil (A,B). If this ratio is small, as in inverse iteration with a good shift, only a few 
iterations will be needed. In this case, the fact that the method is simple, does not require 
reduction to tridiagonal form, and economizes on data movement from memory, make it an 
ideal algorithm, especially for high-performance computers. We note that the most operation 
intensive parts of the algorithm are the first two steps of the RR procedure and the solution 
of the linear systems. The computation cost for the eigenpairs of H A can be ignored since 
q <C n. The RR procedure can be implemented efficiently on a variety of architectures using 
computationally primitive BLAS [ 1 ] [7]. The linear systems involved are banded and positive 
definite. 

For our parallel subspace iteration algorithm, we divide the banded pencil (A, B ) into p 
blocks, with n s rows per block, 


(A,B) 


A \ , B\ 

A'2, B 2 


A P , Bp J 


and assign one block to each processor. Our algorithm is outlined as follows. 


( 5 ) 


Parallel Subspace Iteration Algorithm: 

• Input: (A, B ), tol 
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• Output: (A,, Xi), i = 1, . . . , q, first q eigenpairs of (A, B) 


• In parallel, do on all processors Pj, j = 1, . . . ,p: 


1. Calculate concurrently a shift s, then shift (A, B) to (A s , B ) with A s = A — sB. 

2. Generate Sj , the j — th block of a starting matrix S^°\ where is an n, x q 
matrix. 

3. For k = 0, 1, . . . , 

— Apply the RR procedure concurrently. 

— Solve A 3 ,5'^ +1 ) = BS^ by the PPT or PDD algorithm until the computed 
eigenvalues of (A s , B ) satisfy 


\e\ k) -0j* -1) | 

\0W + s\ 


< tol,i = 1 


4. Set A,- := 9^ + s, x,- := x\ k \ i = 1, , 


The selection of the shift s at Step 1 is critical to the reliability and efficiency of the 
algorithm. There are, however, several competing requirements placed on the shift s. First, 
for the convergence and stability of the algorithm, the shift s should satisfy 


max I A , — si < max I A' — si, 

1 <J<q ' J ' }’>q ' 3 ' 

and s should not be an eigenvalue or too close to an eigenvalue of (A, B) [2]. Second, in 
order to reduce the number of iterations, s should be close to A j, j = 1 since the 

rate of convergence of 9^ is (Aj — s) 1 2 /(A^ + i — s) 2 . Finally, to obtain the greatest efficiency 
the shift s should be chosen, whenever possible, so that the PDD algorithm can be used to 
solve the systems A s S^ ,+1 ) = BS '*), because the PDD algorithm, among all parallel banded 
solvers, is one of the most efficient parallel algorithms. If it is not possible to find a shift 
.s so that A s satisfies test (2), then the PPT algorithm must be used to solve this system. 
Based on these requirements, we use the following process to select a shift s. 

Shift Selection Strategy: 

1. Approximate the smallest eigenvalue A] of (A, B ), then shift (A, B ) to (A,,, B ), where 
A s , = A — Si B with 

sj = Aj — 6 for a 6 > 0 

and 8 is made small compared to Aj. Since the smallest eigenvalue of (A S] , B ) is 8 , the 
pencil (A 3 ,,R) continues to be positive definite, but has a small extreme eigenvalue. 
This guarantees the convergence and stability of the iterations, and also minimizes the 
number of iterations. Find a positive lower bound for the smallest eigenvalue of A. 
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2. Compute the off-diagonal blocks of the reduced matrix for A Sl and the timing ratio for 
the PPT and PDD algorithms 

execution time of PPT , . 

<,me execution time of PDD 

The ratio r (ime is a function of the matrix parameters n , m, p, and r comp /r comTn , where p 
is the number of processors used, and f comp and r comm are the speeds for floating point 
computation and data communication. r (tme can be estimated from the complexities of 
the PPT and PDD algorithms given in Section 2 or measured by numerical experiments. 

3. If test (2) is satisfied by A Sl or r ( , me ~ 1 5 then set s := Si, and stop. 

4. If A passes test (2), apply the Shift Refinement algorithm (below) to (A S] ,B), and 
e = si, and set s 2 = t, s = S\ — s 2 , and stop. Here t denotes the output of the Shift 
Refinement algorithm. 

5. Apply test (2) to B. If B does not pass test (2), then set s 2 := 0, s Si, and stop. 

6. Attempt to compute a positive ( so that A -f- (B satisfies test (2). If such an ( is 
found, apply the Shift Refinement algorithm to (A, B) and e = £. Set s 2 := sj + 1 and 
s := S] — s 2 ; otherwise set s 2 = 0 and $ := sj. 


In the following, we give the details of Steps 1 and 6. 
Computation of Sj: 

The inverse iteration: 


Ax {k) = T (fc_ 1 ) Bx (fc-1) 


A 


(*) _ 

i ■> — 


xW T BxW 


1 , 2 ,.. 




where r ^ is a normalization constant, is used to obtain an approximation of Aj. The 
iteration {(Aj fc \ ad fc ))} converges to the first eigenpair of (A, B ) since the matrix pencil (A, B ) 
is positive definite. Few iterations are needed at this step because only a crude approximation 
of Aj is necessary. In our experiments, the linear systems Ax ^ are solved by 

the Hybrid algorithm introduced in Section 2. We note that matrix factorization is needed 
only at the first iteration. The iterations are terminated when 


- A) 

A}*> 




< 10 


-2 


( 7 ) 


and si is chosen as Si = 0.95A^, which implies that S « 0.04A!. Similarly, we find a positive 
lower bound for the smallest eigenvalue of A for use in Step 6. The quantity 6(A) required 
in Step 4 of the Shift Selection Strategy is a by-product of the inverse iteration in Step 1. 

Computation of s 2 : 

This shift is introduced only when A,, has failed to pass test (2) and r t i me > 1 • This is the 
situation when the PDD algorithm can not be applied to the systems A Sl S' {t+1) = BS W and 
the PPT is much more expensive than the PDD algorithm (most parallel banded solvers [9] 
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[10] [14] [18] have roughly similar computation and communication complexities as those of 
PPT). As indicated by the theoretical and experimental results of Section 2, r time increases 
significantly when the number of processors or the band width increases. If shift s 2 can 
be found, it can be used to remove the bottleneck of the PPT algorithm. However, while 
this shift reduces execution time for each single iteration, it increases the total number of 
iterations at the same time. Whether this shift can and should be performed depends on 
the eigenproblem (3) and the machine architecture. We now describe Step 6 . 

For a positive definite m-banded matrix M it is shown in [ 6 ] that 


< 2 ||A/ - 1 1| 2 7 ‘ >7 = l(M) 


/ \/F— l \ 2/m _ XmaxjM) 

VxA 7 +1/ ' A mtn ( M) 


where \ ma x(M) and X m i n (M ) denote the largest and smallest eigenvalues of M. If Mi is a 
diagonal block of M, then it follows that 7 (Af,) < 7 (A/). Applying test ( 2 ) to M, we find 
that 

S(M) < <7( 7 (M)) rf ” 

where C is a constant which depends on M. We will use the heuristic 


6(A + (B)*C( 1 (A + (B)f’ 


and assume that the constant C does not depend on We determine it by setting C = 
6(A)/( 7 (A)f s . 

We use inverse iteration to find 0 < c 0 < A mtn (£?) and 0 < d 0 < A„,i n (A), and we use 
Gershgorin’s Theorem to find C] — max; £" =1 |%| > A max (B) and di = max, 1 l a o l > 
Amax(A). The lower bound do was found in Step 1. In many cases B has special structure 
(such as diagonal or diagonally dominant) so that Co can be found analytically. We approxi- 
mate 7 (A), 7 (#), and j(A + (B) using r A = d 1 /d 0 , r B = ci/co, and r c = (d x +C c i)/(^o + C c o)- 
These 7 ’s and r’s are upper bounds for the exact values. If 

CWB))* = S(A)(^y‘ > cu, 

then the heuristic has failed and we set s 2 := 0 and s := sj. Otherwise, the heuristic can be 
solved for £ by setting 8(A + ( B ) = cu and solving for (: 


C = 


difcp — rdp/cp 

r -r B 


, r = 



cu\ m A 2dis ) 

~C ) 


( 8 ) 


Finally, if A -f (B does not satisfy test ( 2 ), the heuristic fails and we set s 2 := 0 and s := 5 j. 


Let A and B be partitioned as in equation (5). The following Shift Refinement Algorithm 
uses local bisection iteration to approximate the optimal shift s 2 . 

Shift Refinement Algorithm: 

• Input: Kmax , and (A, B ), e > 0 so that A + eB satisfies test ( 2 ) 

• Output: 0 < t < e so that A T tB satisfies test (2) 
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Table 3: Execution Time of the Parallel Subspace Iterations on iPSC/2. (a = 1.0) 


n 

P 

T p (seconds) 

K 

Ti/T p 

n 

P 

T p (seconds) 

K 

Ti/T p 

1200 

1 

178.867 

5 


1200 

1 

297.537 

6 



2 

92.849 

5 

1.93 


2 

158.383 

6 

1.88 

(m = 5) 

4 

48.798 

5 

3.67 

(m= 10) 

4 

83.953 

6 

3.54 

8 

26.467 

5 

6.76 


8 

44.328 

6 

6.71 


16 

15.473 

5 

11.56 


16 

24.907 

6 

11.95 

3600 

1 

443.994 

4 


3600 

' 1 

763.193 

5 



2 

225.086 

4 

1.97 


2 

396.942 

5 

1.92 

(m = 5) 

4 

1 16.258 

4 

3.82 

( m = 10) 

4 

213.064 

5 

3.58 

8 

60.248 

4 

7.37 


8 

108.022 

5 

7.07 


16 

31.807 

4 

13.96 


16 

56.136 

5 

13.60 


Table 4: Execution Time of the Parallel Subspac.e Iterations on iPSC/860. (a = 0.1) 


m 

n 

P 

T p (seconds) 

m 

n 

P 

T p (seconds) 

PDD 

PPT 

PDD 

PPT 

"lo" 

3200 

1 

69.745 

69.745 

15 

6400 

2 

* 

* 



4 

20.856 

23.090 



4 

55.472 

61.219 



8 

10.775 

14.110 



8 

27.725 

32.583 



16 

5.910 

10.744 



16 

14.342 

21.978 



32 

4.150 

11.988 



32 

7.906 

* 



64 

10.432** 

* 



64 

4.679 

* 


* Exceed memory capacity; ** Shift 52 is used. 


• In parallel, do on all processors i = 1, . . . 

1. If the corresponding off-diagonal blocks of Z for A t passed the test (2), 

set ti := 0; 
else 

initialize <{°^ •= := e/2; for k = 1, . . . , Kmax , 

(k) 

compute corresponding off-diagonal blocks of Z for Ai + t\ Bi. 

If the test (2) is satisfied, t\ k+ ^ := t\ k ^ — c/2 fc+1 ; 
otherwise set t\ k+ ^ := t\^ + t/2 k+l . 

Set ti := min{t\ k) : t\ k ) > t^ hmax+l \ k - 0, . . . , Kmax} 
endif 

2. Following a global communication, get t := max{t t , i = 1, . . ■ ,p}- 


In Tables 3-6, we present the numerical results of the Parallel Subspace Iteration algo- 
rithm on iPSC/2 and iPSC/860 multiprocessors. The program was written in Intel Fortran 
using its communication library. Some Linpack [8] and BLAS [7] Fortran source codes were 
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Table 5: Execution Time of the Parallel Subspace Iterations (n/p = 85, a = 0.1) 


p 

PDD with 52 

PPT without 52 


Tp (iPSC/2) 

T p (iPSC/860) 

K 

T p (iPSC/2) 

T p (iPSC/860) 

K 

4 

96.610 

7.391 

24 

86.101 

6.879 

19 

8 

97.173 

7.584 

24 

112.811 

8.899 

19 

16 

97.757 

7.729 

24 

166.732 

12.541 

19 

32 


7.954 

24 


20.489 

20 


used on the nodes. The testing matrices, except in Table 6, are A — B = /„, with 


( lij 


1 if 0 < \i — j\ < 77i 

< 2 7n + ai if i = j 
0 otherwise 


where cv roughly equals the separation of eigenvalues. T p is the execution time in seconds 
using p processors which includes the time spent on the shift selection, and K is the number 
of iterations. The number of required eigenpairs is q = 10. The error tolerance tol , except in 
Table 5, is chosen such that the eigenvalues are approximated to about six-digit accuracy. In 
Tables 3 and 4, excepting the case marked by **, all testing matrices A Sl passed the test (2). 
Thus, the PDD algorithm is applied directly without any extra iterations. For comparison, 
in Table 4, the results of using the PPT algorithm are presented. The efficiencies are much 
higher when the PDD algorithm is applied, especially for large problem sizes. Table 4 shows 
that if the number of processors is small, the size of a problem may exceed the memory 
capacity of the multiprocessors. More processors must be employed. However as the number 
of processors increases, the order of the resulting reduced system may become too large for a 
single processor. In this situation the PDD algorithm is the only choice for solving the linear 
systems involved even though it may cause more iterations. For the experiments presented 
in Tables 3 and 4, the order n of A and B was held constant. In Table 5 experiments are 
presented in which the order of the local submatrices is held constant at ^ = 85 and the order 
7i of A and B is scaled with the number of processors. Here, the error tolerance was also 
increased to nine-digit accuracy. The half band width is m — 10. These results demonstrate 
the parallel efficiency of the shifting strategy, especially on a large number of processors. 
For these problems, as the number of processors increases, the efficiency remains unchanged 
when s -2 is used. Without it, the performance of the algorithm deteriorates quickly. The 
parallel scalability afforded by the shifting strategy more than offsets the initial overhead in 
computing s -2 and the additional number of iterations. This seems to suggest that as long as 
the shift s 2 is relatively small and introduces a moderate number of iterations, using it with 
the PDD algorithm should be more efficient than the PPT algorithm, especially for large 
problems on a large number of processors. When the shift s 2 is too large, it can cause a 
significant increase in the number of iterations. Whether the shift ,s 2 should be performed is 
problem and machine dependent, and an a prior criterion is not available. Fortunately, once 
the shift s -2 is performed, the efficiency of the shift can be assessed after a few iterations. 

Assessing the efficiency of the shift .s 2 ; 

Assuming the iteration vector has reached its asymptotic rate of convergence, that 
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Table 6: Execution Time of the Parallel Subspace Iterations (PPT) 


n 

P 

T p (seconds) 

Tx/T p 

K 

IaI^-a^-^i 

max i<,< q . (jf). 

lAi 1 

640 

~T 

156.258 


12 

2.3 x 10 -6 


2 

78.719 

1.99 

11 

2.0 x 10 -6 


4 

40.831 

3.83 

10 

5.1 x 10“ 6 


8 

25.815 

6.05 

10 

2.7 x 10- 7 


16 

16.828 

9.29 

8 

1.4 x 10- 6 

3600 

1 

796.953 


11 

3.4 x 10" 6 


2 

422.973 

1.88 

11 

1.5 x 10~ 5 


4 

163.940 

4.86 

8 

1.2 x 10- 5 


8 

75.556 

10.55 

7 

2.3 x 10~ 5 


16 

43.429 

18.35 

7 

1.6 x 10" 6 


is 7 -W as (A,/A ? - + i) 2 (see equation (4)), the convergence rates of the two approaches (PDD 
with shift s 2 or PPT without shift s 2 ) can be estimated by r dd — r ^ and r pl = ((A^ — 

s 2 yWVxR*. — s 2 )) 2 respectively. Let tol = The numbers of iterations for these two 

approaches are roughly n dd = — d/logr dd and n pt = —d/logr pt . The timing ratio of the two 
approaches is 

Time(PPT) x n pt __ _ logr^ 

Rtime = Time(PDD) x n dd ~ rtime * log r pi ' 

If Rtime < 1, the shift s 2 should be abandoned and A Sl used on In this case, the use 

of shift s 2 may cause a large increase in the number of iterations. 

Finally we present interesting numerical results for an eigenvalue problem of a simply 
supported beam from structural mechanics. The stiffness matrix A is positive definite with 
half band width 3, while the mass matrix B is a diagonal matrix with a wide range of diagonal 
entries. The problem comes from the discretization of a differential system by central finite 
differences. Consequently, the largest eigenvalue of (A, B) is 0(n 2 ). Since 

Amar( A, 5) < A max ( A) / \ m i n (5) < di/A mln (B), 

c ~ dj >= X max {A, B) ■ Xmin(B) as 0(n 2 ) as n -> oo 

(see equation (8)), the shift s 2 would be very large if it is needed. In this example, matrix 
A fails test (2) and the shift s 2 is too large, therefore the linear solver PPT is used. Table 
6 shows the performance of the algorithm. It is interesting to note that when the number 
of processors increases, the number of iterations K decreases. Although when using the 
PPT with a large number of processors, the Parallel subspace iterations algorithm generally 
has low efficiency, here “super linear” speedups are observed. In this example, smaller 
subsystems are much better conditioned than larger ones. This may explain the decrease in 
K with increasing p. The answer awaits further investigation. 
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4 Remarks and Conclusions 


The parallel algorithm proposed here needs little more storage than its sequential version. 
Most other parallel eigensolvers require that each processor have direct access to the entire 
matrix pencil (/l, B ), while in our algorithm, the matrix pencil (A, B) and the iterative 
vectors are divided evenly into blocks, which are allocated to corresponding processors. 
Each processor operates only on its own blocks of [A, B) and most of the time. All 
processors solve identical subproblems and communicate the same amount of data at each 
step of the computation. The load is perfectly balanced. The shift s 2 reduces data references 
between processors and greatly increases the parallel efficiencies in some situations. When 
the number of processors is large, secondary memory is usually not necessary even for large 
problems. Data transfer between different levels of memory can be reduced by employing 
block matrix computations, such as BLAS [1] [7]. When the number of processors or the 
band width is large, the size of the reduced system Zy = h can become prohibitive for 
the PPT algorithm and most other banded solvers. In this situation, shift s 2 , with its 
possibly high number of iterations, seems to be the only alternative. The efficiency of 
the shift s 2 is machine dependent. Our numerical results suggested that, on iPSC/2 or 
iPSC/860 hypercube multiprocessors, this shift can lead to higher efficiencies when the ratio 
of largest to smallest eigenvalues of the pencil is 0(n), but it is detrimental when this 
ratio is 0(n 2 ) or greater, which is typically the case with problems derived from differential 
equations. However, in this case, the numerical experiments of Table 6 indicate that the 
PPT version of our algorithm exhibits “superlinear” convergence. This may occur because 
smaller submatrices are much better conditioned than larger ones. 

Many acceleration schemes have been applied to the basic sequential subspace iteration 
method [3] [26], making it efficient for a wide variety of applications. Most of these acceler- 
ating schemes can be easily incorporated into our parallel algorithm. 
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